1. Project Question
What can the location of a Starbucks tell you about its population density in Ohio?
1.1 Purpose of this project
- to show how I check and clean data
- to show how I can visualize data through maps
- show how to get a grasp of census data for a state by county
- show that I can write rMarkdown in a clean, presentable way
- show how I can use git to save different versions of a project and share it
BONUS: if I am able to make this into a shiny app where you can choose the state and see the population, age and income data with the overlay of starbucks locations that would be great
1.2 Libraries
library(rmarkdown) #to create an html doc for this project
library(yaml) #for rmarkdown
library(rmdformats) #for markdown theme
library(readr) #to clean and explore data
library(stringr) #data exploration
library(tidyverse) #to use dplyr, tidyr, ggplot2
library(maps) #for county map lines
library(plotly) #for hover over map
library(RColorBrewer) #for map color
1.3 Data
- Starbucks locations from kaggle
- Census data by county. 2020 data isn’t yet available as of Dec 2021. I am using ACS 1 year estimates for 2019 that can be downloaded by state and county
- population data by age here
For info on Comparing decennial to the ACS https://www.census.gov/content/dam/Census/library/publications/2020/acs/acs_general_handbook_2020_ch09.pdf
2 Starbucks Data
2.1 Data Cleaning Process
2.1.a Load the data
starbucks <- read_csv("~/Dropbox/Data Science Classes/R data visualization/Final Project/starbucks.csv")
view(starbucks) #to open up the file in a new tab
2.1.b review columns and rows
This output tells me there are 28,289 rows and 17 columns
to check which columns I want to use and that they are the correct type.
str(starbucks)
## spec_tbl_df [28,289 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:28289] 0 1 2 3 4 5 6 7 8 9 ...
## $ storeNumber : chr [1:28289] "34638-85784" "32141-267986" "15035-155445" "49646-268445" ...
## $ countryCode : chr [1:28289] "HK" "HK" "HK" "HK" ...
## $ ownershipTypeCode : chr [1:28289] "LS" "LS" "LS" "LS" ...
## $ schedule : chr [1:28289] "[{'dayName': 'Today', 'hours': '8:30 AM to 10:30 PM', 'open': True, 'holiday': ''}, {'dayName': 'Tomorrow', 'ho"| __truncated__ "[{'dayName': 'Today', 'hours': '7:30 AM to 10:00 PM', 'open': True, 'holiday': ''}, {'dayName': 'Tomorrow', 'ho"| __truncated__ "[{'dayName': 'Today', 'hours': '8:00 AM to 10:00 PM', 'open': True, 'holiday': ''}, {'dayName': 'Tomorrow', 'ho"| __truncated__ "[{'dayName': 'Today', 'hours': '8:00 AM to 10:00 PM', 'open': True, 'holiday': ''}, {'dayName': 'Tomorrow', 'ho"| __truncated__ ...
## $ slug : chr [1:28289] "荷李活廣場-level-2-plaza-hollywood-diamond-hill-hong-kong-91-" "黃大仙中心南館-shop-no-g-3-b-ground-floor-wong-tai-sin-hong-kong-" "mikiki-shop-no-g-01-ground-floor-mikiki-638-kowloon-91-hk" "九龍城廣場-九龍九龍城廣場-地下低層-lg-10-舖-香港-91-hk" ...
## $ latitude : num [1:28289] 22.3 22.3 22.3 22.3 22.3 ...
## $ longitude : num [1:28289] 114 114 114 114 114 ...
## $ streetAddressLine1 : chr [1:28289] "Level 2, Plaza Hollywood, Diamond Hill," "Shop No. G3B, Ground Floor, Wong Tai Sin" "Shop No. G01, Ground Floor, Mikiki 638" "九龍九龍城廣場 地下低層LG10舖" ...
## $ streetAddressLine2 : chr [1:28289] "Kowloon" "Shopping Centre, Wong Tai Sin, Kowloon" "Prince Edward Road East," NA ...
## $ streetAddressLine3 : chr [1:28289] NA NA NA NA ...
## $ city : chr [1:28289] "Hong Kong" "Hong Kong" "Kowloon" "香港" ...
## $ countrySubdivisionCode: chr [1:28289] "91" "91" "91" "91" ...
## $ postalCode : chr [1:28289] NA NA NA NA ...
## $ currentTimeOffset : num [1:28289] 480 480 480 480 480 480 480 480 480 480 ...
## $ windowsTimeZoneId : chr [1:28289] "China Standard Time" "China Standard Time" "China Standard Time" "China Standard Time" ...
## $ olsonTimeZoneId : chr [1:28289] "GMT+08:00 Asia/Hong_Kong" "GMT+08:00 Asia/Hong_Kong" "GMT+08:00 Asia/Hong_Kong" "GMT+08:00 Asia/Hong_Kong" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. storeNumber = col_character(),
## .. countryCode = col_character(),
## .. ownershipTypeCode = col_character(),
## .. schedule = col_character(),
## .. slug = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. streetAddressLine1 = col_character(),
## .. streetAddressLine2 = col_character(),
## .. streetAddressLine3 = col_character(),
## .. city = col_character(),
## .. countrySubdivisionCode = col_character(),
## .. postalCode = col_character(),
## .. currentTimeOffset = col_double(),
## .. windowsTimeZoneId = col_character(),
## .. olsonTimeZoneId = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Based on this output I want to most likely use:
storeNumberas an indexcountryCodeand filter by the USlatitudefor mapping the locationlongitudesame as above- a column that has state (if I’m lucky)
2.1.c Filter out only US locations
I can’t use all columns as is. I need to filter by countryCode to pull only US locations and see how I can use the data to filter by state
To filter by countryCode I want to see the unique rows to see which one is for the US. I also want it ascending so that I can find U easily.
starbucks %>%
distinct(countryCode) %>% #list out unique county codes
arrange(desc(countryCode)) #put in descending order since U is late in the alphabet
## # A tibble: 49 × 1
## countryCode
## <chr>
## 1 US
## 2 TW
## 3 TR
## 4 TH
## 5 SV
## 6 SK
## 7 SG
## 8 SA
## 9 RU
## 10 RO
## # … with 39 more rows
Now I know that US is the countryCode. Obvious, but just had to check.
I can now filter by country and it looks like countrySubdivisionCode is the state, so I can find Ohio.The output on this markdown file doesn’t show this column, but it shows in my Rmd file.
starbucks %>%
filter(countryCode == "US")
## # A tibble: 15,003 × 17
## ...1 storeNumber countryCode ownershipTypeCode schedule slug latitude
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 861 74256-51879 US LS [{'dayName… safewa… 48.1
## 2 862 79652-104215 US LS [{'dayName… safewa… 48.1
## 3 866 9940-96882 US CO [{'dayName… sequim… 48.1
## 4 867 74255-51779 US LS [{'dayName… safewa… 48.1
## 5 869 64668-298098 US LS [{'dayName… safewa… 48.3
## 6 870 3286-290340 US CO [{'dayName… sr-20-… 48.3
## 7 871 74882-99767 US LS [{'dayName… haggen… 48.3
## 8 872 74267-130085 US LS [{'dayName… safewa… 48.1
## 9 873 13776-105935 US CO [{'dayName… harbor… 48.3
## 10 874 3279-1869 US CO [{'dayName… commer… 48.5
## # … with 14,993 more rows, and 10 more variables: longitude <dbl>,
## # streetAddressLine1 <chr>, streetAddressLine2 <chr>,
## # streetAddressLine3 <chr>, city <chr>, countrySubdivisionCode <chr>,
## # postalCode <chr>, currentTimeOffset <dbl>, windowsTimeZoneId <chr>,
## # olsonTimeZoneId <chr>
2.1.d pull only Ohio
By looking at the disctinct values in countrySubdivisionCode I can see Ohio is abbreviated as “OH”. Obvious, but need to check just in case.
starbucks %>%
filter(countryCode == "US") %>%
distinct(countrySubdivisionCode) %>%
arrange(countrySubdivisionCode)
## # A tibble: 51 × 1
## countrySubdivisionCode
## <chr>
## 1 AK
## 2 AL
## 3 AR
## 4 AZ
## 5 CA
## 6 CO
## 7 CT
## 8 DC
## 9 DE
## 10 FL
## # … with 41 more rows
2.1.e Pull out US and OH locations and only the columns I want
Now I can filter for the “US” and “OH” and only pull the columns I wanted above.
Here is a revised list of the columns I want
storeNumberas an indexcountryCodeand filter by the UScityin case I need to cross check the lat/long with citylatitudefor mapping the locationlongitudesame as abovecountrySubdivisionCodefor the state
While filter gives me select rows, the select dplyr function gives me the columns I want
starbucksOH <- starbucks %>% #I'm going to save it as a new dataframe
filter(countryCode == "US" & countrySubdivisionCode == "OH") %>%
select(storeNumber, countryCode, city, latitude, longitude, countrySubdivisionCode)
2.1.f check structure make sure dates are dates, numbers are numbers
str(starbucksOH)
## tibble [451 × 6] (S3: tbl_df/tbl/data.frame)
## $ storeNumber : chr [1:451] "2681-66702" "72243-24307" "26359-243806" "2355-4430" ...
## $ countryCode : chr [1:451] "US" "US" "US" "US" ...
## $ city : chr [1:451] "Lewis Center" "Columbus" "Columbus" "Columbus" ...
## $ latitude : num [1:451] 40.2 40.1 40.1 40.1 40.1 ...
## $ longitude : num [1:451] -83 -83 -83 -83 -83 ...
## $ countrySubdivisionCode: chr [1:451] "OH" "OH" "OH" "OH" ...
latitude and longitude are Numbers while the rest are characters and that looks great.
There are 451 starbucks locations in Ohio. To get a gut check, Spoon university states there are 375, but they don’t have a date in their article, so it may be old.
Spoon’s numbers are 16.9% less, which is not too far off to what we have for 2021 data, so I’m going to go with it.
2.1.d check for NA values, where are they and is it an issue?
In the code below I am filtering for any rows that have NA in the columns in the parentheses. Looks like there are no NA’s.
starbucksOH %>%
filter(is.na(latitude))
## # A tibble: 0 × 6
## # … with 6 variables: storeNumber <chr>, countryCode <chr>, city <chr>,
## # latitude <dbl>, longitude <dbl>, countrySubdivisionCode <chr>
starbucksOH %>%
filter(is.na(longitude))
## # A tibble: 0 × 6
## # … with 6 variables: storeNumber <chr>, countryCode <chr>, city <chr>,
## # latitude <dbl>, longitude <dbl>, countrySubdivisionCode <chr>
starbucksOH %>%
filter(is.na(city))
## # A tibble: 0 × 6
## # … with 6 variables: storeNumber <chr>, countryCode <chr>, city <chr>,
## # latitude <dbl>, longitude <dbl>, countrySubdivisionCode <chr>
2.2 Data manipulation process
I had thought that I would have to add a table with zip codes and states to get a state column, but there is no need since the countrySubdivisionCode column is the state.
3. Census population data
3.1 data checking process
3.1.a load the data
populationOH <- read_csv("~/Dropbox/Data Science Classes/R data visualization/Final Project/ohioPopByCounty.csv")
view(populationOH)
3.1.b check columns
The census population data is estimated based on the last deceinnial year. I know that the “year” is from 1-12 and 12 is 2021.
I want the following columns:
STNAMECTYNAMEYEARis going to be 12POPESTIMATE
populationOH %>%
colnames()
## [1] "SUMLEV" "STATE" "COUNTY" "STNAME"
## [5] "CTYNAME" "YEAR" "POPESTIMATE" "POPEST_MALE"
## [9] "POPEST_FEM" "UNDER5_TOT" "UNDER5_MALE" "UNDER5_FEM"
## [13] "AGE513_TOT" "AGE513_MALE" "AGE513_FEM" "AGE1417_TOT"
## [17] "AGE1417_MALE" "AGE1417_FEM" "AGE1824_TOT" "AGE1824_MALE"
## [21] "AGE1824_FEM" "AGE16PLUS_TOT" "AGE16PLUS_MALE" "AGE16PLUS_FEM"
## [25] "AGE18PLUS_TOT" "AGE18PLUS_MALE" "AGE18PLUS_FEM" "AGE1544_TOT"
## [29] "AGE1544_MALE" "AGE1544_FEM" "AGE2544_TOT" "AGE2544_MALE"
## [33] "AGE2544_FEM" "AGE4564_TOT" "AGE4564_MALE" "AGE4564_FEM"
## [37] "AGE65PLUS_TOT" "AGE65PLUS_MALE" "AGE65PLUS_FEM" "AGE04_TOT"
## [41] "AGE04_MALE" "AGE04_FEM" "AGE59_TOT" "AGE59_MALE"
## [45] "AGE59_FEM" "AGE1014_TOT" "AGE1014_MALE" "AGE1014_FEM"
## [49] "AGE1519_TOT" "AGE1519_MALE" "AGE1519_FEM" "AGE2024_TOT"
## [53] "AGE2024_MALE" "AGE2024_FEM" "AGE2529_TOT" "AGE2529_MALE"
## [57] "AGE2529_FEM" "AGE3034_TOT" "AGE3034_MALE" "AGE3034_FEM"
## [61] "AGE3539_TOT" "AGE3539_MALE" "AGE3539_FEM" "AGE4044_TOT"
## [65] "AGE4044_MALE" "AGE4044_FEM" "AGE4549_TOT" "AGE4549_MALE"
## [69] "AGE4549_FEM" "AGE5054_TOT" "AGE5054_MALE" "AGE5054_FEM"
## [73] "AGE5559_TOT" "AGE5559_MALE" "AGE5559_FEM" "AGE6064_TOT"
## [77] "AGE6064_MALE" "AGE6064_FEM" "AGE6569_TOT" "AGE6569_MALE"
## [81] "AGE6569_FEM" "AGE7074_TOT" "AGE7074_MALE" "AGE7074_FEM"
## [85] "AGE7579_TOT" "AGE7579_MALE" "AGE7579_FEM" "AGE8084_TOT"
## [89] "AGE8084_MALE" "AGE8084_FEM" "AGE85PLUS_TOT" "AGE85PLUS_MALE"
## [93] "AGE85PLUS_FEM" "MEDIAN_AGE_TOT" "MEDIAN_AGE_MALE" "MEDIAN_AGE_FEM"
3.1.c Look at data structure
There are a lot of columns that separate out the population data
populationOH %>%
str()
## spec_tbl_df [1,056 × 96] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SUMLEV : chr [1:1056] "050" "050" "050" "050" ...
## $ STATE : num [1:1056] 39 39 39 39 39 39 39 39 39 39 ...
## $ COUNTY : chr [1:1056] "001" "001" "001" "001" ...
## $ STNAME : chr [1:1056] "Ohio" "Ohio" "Ohio" "Ohio" ...
## $ CTYNAME : chr [1:1056] "Adams County" "Adams County" "Adams County" "Adams County" ...
## $ YEAR : num [1:1056] 1 2 3 4 5 6 7 8 9 10 ...
## $ POPESTIMATE : num [1:1056] 28550 28541 28537 28459 28293 ...
## $ POPEST_MALE : num [1:1056] 14095 14084 14079 14042 13960 ...
## $ POPEST_FEM : num [1:1056] 14455 14457 14458 14417 14333 ...
## $ UNDER5_TOT : num [1:1056] 1924 1924 1913 1856 1792 ...
## $ UNDER5_MALE : num [1:1056] 1001 1001 997 955 916 ...
## $ UNDER5_FEM : num [1:1056] 923 923 916 901 876 836 840 809 793 788 ...
## $ AGE513_TOT : num [1:1056] 3610 3611 3633 3676 3650 ...
## $ AGE513_MALE : num [1:1056] 1793 1789 1806 1858 1863 ...
## $ AGE513_FEM : num [1:1056] 1817 1822 1827 1818 1787 ...
## $ AGE1417_TOT : num [1:1056] 1601 1597 1584 1555 1541 ...
## $ AGE1417_MALE : num [1:1056] 795 793 778 741 736 740 773 779 801 828 ...
## $ AGE1417_FEM : num [1:1056] 806 804 806 814 805 792 837 845 814 794 ...
## $ AGE1824_TOT : num [1:1056] 2260 2256 2263 2255 2233 ...
## $ AGE1824_MALE : num [1:1056] 1131 1127 1134 1133 1096 ...
## $ AGE1824_FEM : num [1:1056] 1129 1129 1129 1122 1137 ...
## $ AGE16PLUS_TOT : num [1:1056] 22243 22234 22232 22177 22065 ...
## $ AGE16PLUS_MALE : num [1:1056] 10916 10910 10903 10875 10813 ...
## $ AGE16PLUS_FEM : num [1:1056] 11327 11324 11329 11302 11252 ...
## $ AGE18PLUS_TOT : num [1:1056] 21415 21409 21407 21372 21310 ...
## $ AGE18PLUS_MALE : num [1:1056] 10506 10501 10498 10488 10445 ...
## $ AGE18PLUS_FEM : num [1:1056] 10909 10908 10909 10884 10865 ...
## $ AGE1544_TOT : num [1:1056] 10471 10463 10450 10282 10134 ...
## $ AGE1544_MALE : num [1:1056] 5202 5198 5187 5101 5009 ...
## $ AGE1544_FEM : num [1:1056] 5269 5265 5263 5181 5125 ...
## $ AGE2544_TOT : num [1:1056] 6980 6979 6969 6856 6761 ...
## $ AGE2544_MALE : num [1:1056] 3459 3460 3453 3408 3365 ...
## $ AGE2544_FEM : num [1:1056] 3521 3519 3516 3448 3396 ...
## $ AGE4564_TOT : num [1:1056] 7945 7945 7948 7976 7899 ...
## $ AGE4564_MALE : num [1:1056] 3978 3975 3969 3970 3936 ...
## $ AGE4564_FEM : num [1:1056] 3967 3970 3979 4006 3963 ...
## $ AGE65PLUS_TOT : num [1:1056] 4230 4229 4227 4285 4417 ...
## $ AGE65PLUS_MALE : num [1:1056] 1938 1939 1942 1977 2048 ...
## $ AGE65PLUS_FEM : num [1:1056] 2292 2290 2285 2308 2369 ...
## $ AGE04_TOT : num [1:1056] 1924 1924 1913 1856 1792 ...
## $ AGE04_MALE : num [1:1056] 1001 1001 997 955 916 ...
## $ AGE04_FEM : num [1:1056] 923 923 916 901 876 836 840 809 793 788 ...
## $ AGE59_TOT : num [1:1056] 2000 2003 2007 2014 2012 ...
## $ AGE59_MALE : num [1:1056] 1038 1037 1040 1057 1049 ...
## $ AGE59_FEM : num [1:1056] 962 966 967 957 963 937 919 935 898 880 ...
## $ AGE1014_TOT : num [1:1056] 1980 1977 1992 2046 2039 ...
## $ AGE1014_MALE : num [1:1056] 938 934 944 982 1002 ...
## $ AGE1014_FEM : num [1:1056] 1042 1043 1048 1064 1037 ...
## $ AGE1519_TOT : num [1:1056] 1964 1959 1938 1842 1752 ...
## $ AGE1519_MALE : num [1:1056] 992 989 975 915 857 845 864 867 887 910 ...
## $ AGE1519_FEM : num [1:1056] 972 970 963 927 895 895 881 919 913 891 ...
## $ AGE2024_TOT : num [1:1056] 1527 1525 1543 1584 1621 ...
## $ AGE2024_MALE : num [1:1056] 751 749 759 778 787 765 770 786 737 709 ...
## $ AGE2024_FEM : num [1:1056] 776 776 784 806 834 808 810 758 731 723 ...
## $ AGE2529_TOT : num [1:1056] 1503 1504 1498 1460 1477 ...
## $ AGE2529_MALE : num [1:1056] 720 720 718 710 741 714 715 746 783 800 ...
## $ AGE2529_FEM : num [1:1056] 783 784 780 750 736 742 742 798 802 823 ...
## $ AGE3034_TOT : num [1:1056] 1649 1652 1664 1651 1595 ...
## $ AGE3034_MALE : num [1:1056] 805 808 816 821 785 763 745 716 712 701 ...
## $ AGE3034_FEM : num [1:1056] 844 844 848 830 810 810 800 751 723 710 ...
## $ AGE3539_TOT : num [1:1056] 1868 1866 1847 1755 1705 ...
## $ AGE3539_MALE : num [1:1056] 940 938 923 861 829 834 825 792 801 781 ...
## $ AGE3539_FEM : num [1:1056] 928 928 924 894 876 848 839 826 825 790 ...
## $ AGE4044_TOT : num [1:1056] 1960 1957 1960 1990 1984 ...
## $ AGE4044_MALE : num [1:1056] 994 994 996 1016 1010 ...
## $ AGE4044_FEM : num [1:1056] 966 963 964 974 974 963 947 928 920 905 ...
## $ AGE4549_TOT : num [1:1056] 2094 2094 2088 2031 1952 ...
## $ AGE4549_MALE : num [1:1056] 1019 1019 1012 986 955 ...
## $ AGE4549_FEM : num [1:1056] 1075 1075 1076 1045 997 ...
## $ AGE5054_TOT : num [1:1056] 2170 2170 2159 2146 2131 ...
## $ AGE5054_MALE : num [1:1056] 1083 1082 1075 1055 1046 ...
## $ AGE5054_FEM : num [1:1056] 1087 1088 1084 1091 1085 ...
## $ AGE5559_TOT : num [1:1056] 1977 1978 1994 2024 2045 ...
## $ AGE5559_MALE : num [1:1056] 1001 1001 1007 1021 1038 ...
## $ AGE5559_FEM : num [1:1056] 976 977 987 1003 1007 ...
## $ AGE6064_TOT : num [1:1056] 1704 1703 1707 1775 1771 ...
## $ AGE6064_MALE : num [1:1056] 875 873 875 908 897 916 928 947 974 980 ...
## $ AGE6064_FEM : num [1:1056] 829 830 832 867 874 896 905 945 966 981 ...
## $ AGE6569_TOT : num [1:1056] 1408 1406 1388 1366 1463 ...
## $ AGE6569_MALE : num [1:1056] 679 679 672 669 731 759 795 817 850 830 ...
## $ AGE6569_FEM : num [1:1056] 729 727 716 697 732 714 743 754 776 804 ...
## $ AGE7074_TOT : num [1:1056] 1043 1043 1054 1107 1137 ...
## $ AGE7074_MALE : num [1:1056] 517 517 523 552 549 561 575 574 557 622 ...
## $ AGE7074_FEM : num [1:1056] 526 526 531 555 588 632 640 618 604 622 ...
## $ AGE7579_TOT : num [1:1056] 783 784 782 801 817 821 828 857 890 903 ...
## $ AGE7579_MALE : num [1:1056] 350 351 352 368 379 384 381 395 417 411 ...
## $ AGE7579_FEM : num [1:1056] 433 433 430 433 438 437 447 462 473 492 ...
## $ AGE8084_TOT : num [1:1056] 548 548 552 556 547 574 568 566 569 596 ...
## $ AGE8084_MALE : num [1:1056] 241 241 243 233 222 237 230 229 234 243 ...
## $ AGE8084_FEM : num [1:1056] 307 307 309 323 325 337 338 337 335 353 ...
## $ AGE85PLUS_TOT : num [1:1056] 448 448 451 455 453 442 462 476 488 477 ...
## $ AGE85PLUS_MALE : num [1:1056] 151 151 152 155 167 166 178 177 172 171 ...
## $ AGE85PLUS_FEM : num [1:1056] 297 297 299 300 286 276 284 299 316 306 ...
## $ MEDIAN_AGE_TOT : num [1:1056] 39.6 39.6 39.7 40.1 40.4 40.9 40.9 41.2 41.4 41.6 ...
## $ MEDIAN_AGE_MALE: num [1:1056] 39.3 39.3 39.3 39.7 40.1 40.5 40.5 40.8 40.8 40.8 ...
## $ MEDIAN_AGE_FEM : num [1:1056] 40 40 40 40.4 40.8 41.2 41.4 41.6 41.9 42.4 ...
## - attr(*, "spec")=
## .. cols(
## .. SUMLEV = col_character(),
## .. STATE = col_double(),
## .. COUNTY = col_character(),
## .. STNAME = col_character(),
## .. CTYNAME = col_character(),
## .. YEAR = col_double(),
## .. POPESTIMATE = col_double(),
## .. POPEST_MALE = col_double(),
## .. POPEST_FEM = col_double(),
## .. UNDER5_TOT = col_double(),
## .. UNDER5_MALE = col_double(),
## .. UNDER5_FEM = col_double(),
## .. AGE513_TOT = col_double(),
## .. AGE513_MALE = col_double(),
## .. AGE513_FEM = col_double(),
## .. AGE1417_TOT = col_double(),
## .. AGE1417_MALE = col_double(),
## .. AGE1417_FEM = col_double(),
## .. AGE1824_TOT = col_double(),
## .. AGE1824_MALE = col_double(),
## .. AGE1824_FEM = col_double(),
## .. AGE16PLUS_TOT = col_double(),
## .. AGE16PLUS_MALE = col_double(),
## .. AGE16PLUS_FEM = col_double(),
## .. AGE18PLUS_TOT = col_double(),
## .. AGE18PLUS_MALE = col_double(),
## .. AGE18PLUS_FEM = col_double(),
## .. AGE1544_TOT = col_double(),
## .. AGE1544_MALE = col_double(),
## .. AGE1544_FEM = col_double(),
## .. AGE2544_TOT = col_double(),
## .. AGE2544_MALE = col_double(),
## .. AGE2544_FEM = col_double(),
## .. AGE4564_TOT = col_double(),
## .. AGE4564_MALE = col_double(),
## .. AGE4564_FEM = col_double(),
## .. AGE65PLUS_TOT = col_double(),
## .. AGE65PLUS_MALE = col_double(),
## .. AGE65PLUS_FEM = col_double(),
## .. AGE04_TOT = col_double(),
## .. AGE04_MALE = col_double(),
## .. AGE04_FEM = col_double(),
## .. AGE59_TOT = col_double(),
## .. AGE59_MALE = col_double(),
## .. AGE59_FEM = col_double(),
## .. AGE1014_TOT = col_double(),
## .. AGE1014_MALE = col_double(),
## .. AGE1014_FEM = col_double(),
## .. AGE1519_TOT = col_double(),
## .. AGE1519_MALE = col_double(),
## .. AGE1519_FEM = col_double(),
## .. AGE2024_TOT = col_double(),
## .. AGE2024_MALE = col_double(),
## .. AGE2024_FEM = col_double(),
## .. AGE2529_TOT = col_double(),
## .. AGE2529_MALE = col_double(),
## .. AGE2529_FEM = col_double(),
## .. AGE3034_TOT = col_double(),
## .. AGE3034_MALE = col_double(),
## .. AGE3034_FEM = col_double(),
## .. AGE3539_TOT = col_double(),
## .. AGE3539_MALE = col_double(),
## .. AGE3539_FEM = col_double(),
## .. AGE4044_TOT = col_double(),
## .. AGE4044_MALE = col_double(),
## .. AGE4044_FEM = col_double(),
## .. AGE4549_TOT = col_double(),
## .. AGE4549_MALE = col_double(),
## .. AGE4549_FEM = col_double(),
## .. AGE5054_TOT = col_double(),
## .. AGE5054_MALE = col_double(),
## .. AGE5054_FEM = col_double(),
## .. AGE5559_TOT = col_double(),
## .. AGE5559_MALE = col_double(),
## .. AGE5559_FEM = col_double(),
## .. AGE6064_TOT = col_double(),
## .. AGE6064_MALE = col_double(),
## .. AGE6064_FEM = col_double(),
## .. AGE6569_TOT = col_double(),
## .. AGE6569_MALE = col_double(),
## .. AGE6569_FEM = col_double(),
## .. AGE7074_TOT = col_double(),
## .. AGE7074_MALE = col_double(),
## .. AGE7074_FEM = col_double(),
## .. AGE7579_TOT = col_double(),
## .. AGE7579_MALE = col_double(),
## .. AGE7579_FEM = col_double(),
## .. AGE8084_TOT = col_double(),
## .. AGE8084_MALE = col_double(),
## .. AGE8084_FEM = col_double(),
## .. AGE85PLUS_TOT = col_double(),
## .. AGE85PLUS_MALE = col_double(),
## .. AGE85PLUS_FEM = col_double(),
## .. MEDIAN_AGE_TOT = col_double(),
## .. MEDIAN_AGE_MALE = col_double(),
## .. MEDIAN_AGE_FEM = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
3.2 population by county final data frame
populationOHv2<- populationOH %>%
filter(YEAR == 12) %>% #get only rows with year 12 which is estimated 2021 data
select(STNAME, CTYNAME, YEAR, POPESTIMATE)%>% #extract only these columns
mutate(across(CTYNAME, str_replace, " County", "")) %>% #take off "county"
mutate(across(CTYNAME, tolower))%>% #convert to lowercase
rename(subregion = CTYNAME) %>% #renames county column to subregion to match county data
rename(population = POPESTIMATE) #renames population column
head(populationOHv2)
## # A tibble: 6 × 4
## STNAME subregion YEAR population
## <chr> <chr> <dbl> <dbl>
## 1 Ohio adams 12 27698
## 2 Ohio allen 12 102351
## 3 Ohio ashland 12 53484
## 4 Ohio ashtabula 12 97241
## 5 Ohio athens 12 65327
## 6 Ohio auglaize 12 45656
3.2.a check for na’s in rows
Skimming through the table there are no NA values
4. Visualizations
4.1 Choose population ranges for map index
I want to know the population rages in the index. I’m cheating a bit here. I’m backtracking after looking into the future. Basically. I made the map and saw that the default index didn’t show enough variation in the population, so I am going to make my own index buckets in a new column. I tried to do this after joining the data and what happens is the number or rows increases during the join. I just want to bucket it with a dataset with 1 row per county.
To find the right buckets I’m going to look at a histogram of the population values.
# code for histogram in base r, but I want to manipulate the bin width, so I decided to go with ggplot2
# hist(populationOHv2$population)
ggplot(populationOHv2, aes(x=population)) +
geom_histogram(binwidth=25000) +
scale_x_continuous(labels = scales::comma) #this is to get rid of scientific notation in the x axis
From this histogram it looks like 0-100,000 is where most counties are fall. Given this, I want to split up that range in 20k chunks (ie. 0-24,999; 25k-49,999, etc.)
I am going to have the last one be a catch all for anything above because you can hover over the county to see the population
4.1.2 Create a new column with these buckets
I used the mutate function and the cut function in dplyr. The breaks have to equal the labels function.
populationOHv3 <- populationOHv2 %>%
mutate(populationBuckets = cut(populationOHv2$population,
breaks = c(0, 25000, 50000, 75000,100000,125000,150000,175000,200000,250000,500000,1000000,1500000,2000000),
include.lowest = T, #min value is included in the interval
right = F, #interval should be left closed keeping the min value
labels = c('0-24,999', '25k-49,999','50k-74,999', '75k-99,999','100k-124,999', '125k-149,999','150k-174,999','175k-199,999','200k-249,999','250k-499,999','500k-999,999','1M-1.49M','1.5M-2M')))
head(populationOHv3) #printing the first 6 rows to check the new column
## # A tibble: 6 × 5
## STNAME subregion YEAR population populationBuckets
## <chr> <chr> <dbl> <dbl> <fct>
## 1 Ohio adams 12 27698 25k-49,999
## 2 Ohio allen 12 102351 100k-124,999
## 3 Ohio ashland 12 53484 50k-74,999
## 4 Ohio ashtabula 12 97241 75k-99,999
## 5 Ohio athens 12 65327 50k-74,999
## 6 Ohio auglaize 12 45656 25k-49,999
4.2 Population of Ohio
Next we need to update the r library(maps) that has county latitude and longitude and can depict each county with the updated population
countiesOH <- map_data("county") %>%
filter(region == "ohio") %>% #only pull ohio data
left_join(populationOHv3, by = 'subregion') #join the census data with the county map data
head(countiesOH)
## long lat group order region subregion STNAME YEAR population
## 1 -83.66902 39.02989 2012 59960 ohio adams Ohio 12 27698
## 2 -83.56590 39.02989 2012 59961 ohio adams Ohio 12 27698
## 3 -83.37109 39.06426 2012 59962 ohio adams Ohio 12 27698
## 4 -83.30806 39.06426 2012 59963 ohio adams Ohio 12 27698
## 5 -83.30233 39.05280 2012 59964 ohio adams Ohio 12 27698
## 6 -83.25649 39.01842 2012 59965 ohio adams Ohio 12 27698
## populationBuckets
## 1 25k-49,999
## 2 25k-49,999
## 3 25k-49,999
## 4 25k-49,999
## 5 25k-49,999
## 6 25k-49,999
As you can see the dataset is ready to be mapped
mapOH <-
ggplot(data = countiesOH, aes(x = long, y = lat,
text = paste("County: ", subregion,
"<br>Population: ", round(population/1000,digits =1),"k"
)
)) + #for hover text using plotly
geom_polygon(aes(group = population, fill = populationBuckets)) +
theme_dark() +
scale_fill_manual(name = "Population Range", values = c('0-24,999'= '#ffffff',
'25k-49,999'= '#f7fcf5' ,
'50k-74,999'= '#e5f5e0',
'75k-99,999'= '#c7e9c0' ,
'100k-124,999'= '#a1d99b',
'125k-149,999'= '#74c476',
'150k-174,999'= '#41ab5d',
'175k-199,999'= '#238b45',
'200k-249,999'= '#006d2c',
'250k-499,999'= '#238b45',
'500k-999,999'= '#00441b',
'1M-1.49M'= '#252525',
'1.5M-2M' = '#000000' ))
#When I tried the color brewer it stopped at 9 colors so I added manual colors
mapOH
TIP: scale is when you are not separating out your numbers into buckets (i.e. if I used the population column). If they are in buckets they are factors and you use fill for color
4.3 Starbucks Map of locations in Ohio
Here I will add the starbucks locations in Ohio
mapOH2 <- mapOH +
ggtitle("Ohio Starbucks Locations and Population") +
geom_point(data=starbucksOH, aes(x=longitude, y=latitude,
text = paste("Store Number: ", storeNumber)#text here shows what to come up hovering over the dots
))
ggplotly(mapOH2, tooltip = "text")
It looks like population size effects where the starbucks are located in Ohio.
5. Challenges and Final Thoughts
5.1 What took the most time
“3.2 population by county final data frame” took me the most time. Why?
- Searching on the internet I got a lot of base R functions to clean the data. I then had to sleep on it and realize I should find dplyr functions.
- I ran into a snag when I tried to run both base R and dplyr functions because my data would no longer be a data frame after running base R functions. First
mutate()would give me an error and after searching google I realized it was because my data wasn’t a dataframe when I didis.data.frame(dataset) - Understanding which functions effected rows (individuals vectors) vs columns was a skill I honed here. This Column-wise Operations article was very helpful
5.2 Final thoughts
Most of the time definitely went into getting the data to be in the right format for the visualization.